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imODUCTION 


The advent of vector processing computer systems has brightened the 
prospects for investigating large hydrodynamics problems which are beyond 
the computational scope of current computers. Problon formulation » particu- 
l8ir3y in the area of data organization, is crucial to the effective use of 
vector processing machines. A representative hydrodynamics problem, the 
shock initiated flow over a flat plate, was used for exploring data organiza- 
tions and program structures needed to exploit the STAR-100 vector processing 
computer. A brief description of the problma is followed by a discussion of 
how each portion of the computational process was vectorized. Finally, timings 
of different portions of the program are compared with equivalent operations 
on serial machines. The speed up of the STAR-100 over the CDC 6600 program 
is shown to increase as the problem size increases. All computations were 
carried out on a CDC 6600 and a CDC STAR 100, with code written in FORTRAN 
for the 6600 and in STAR FORTRAN for the STAR 100. 
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LIST OF SYMBOLS 

dismay variable 
thermal conductivity 

number of grid xxaints in the y direction 
number of grid points in the x direction 
static pressure 
time 

Temperature 

X ccxaponent of velocity 
y component of velocity 
distance parallel to the freestream 
disteuice normal to the freestream 
compression parameter in equation (l) 
spatial increment in x 
spatial increment in Tj 

non-dimensional distance normal to the freestream 

viscosity- 

density 

dummy variable 

maximum length parallel to the freestream 

maximum length nc.-mal to the freestream in the 
transformed coordinate 
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R>ol)l«a Description 

The estahlishment of a test gas flow over a body in an impulse facility 
such as a shock tube, a shock tunnel or an expansion tube is the result of the 
asymptotic relaxation of an initially unsteady flow to a steady state. 

Several aneJ-ytical papers have been written on the establishment of a 
shock-induced laminar boundary-layer on a flat plate (refs. 1, 2, 3). Analyses 
in these publications were based on large freestream unit Reynolds numbers 
which allowed the effects of a boundary-layer induced leading-edge shock and 
its interaction with the boundary layer to be neglected. However, when the 
freestream unit Reynolds number becomes siifficiently small, these effects 
can have an appreciable impact on the development of the shock- induced, as 
well as the steady-state, flow over a flat plate. A study of these leading 
edge effects led to the flow field code described in this paper. 

The effects of shock boundary-layer interaction on the flow over a 
flat plate have been extensively investigated. The remarks found in ref. U 
provide a good, basic discussion of the subject. Briefly, if the freestream 
unit Reynolds number is sufficiently small, the boundary-layer growth at the 
leading edge of the plate will be great enough to make the leading edge appear 
vC the flow as a blunted body thus creating a leading-edge shock whose strength 
decreases with distance from the plate and eventxaally degenerates to a Mach 
wave. At steady state, the flow along the plate can le divided into two 
regions: (l) a merged region near the leading edge where there is no 

separation between the shock and boundary-layer and (2) a weak interaction 
region in which the shock and boundary layer are distinct. 

The dominant features of the unsteady flow field are shown on figure 1. 
Here, a normal shock wave is moving from left to right across the plate into 
a gas at rest while the leading edge shock is beginning to form due to the 
initial boundary layer growth. That portion of the flow near the leading 
edge will be approaching a steady state while the rest of the flow field 
is unsteady. 
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Method of Solution 

The unsteady flow field for this problem con be generated using a finite 
difference representation of the unsteady, compressible, two dimensional 
Savier-Stokes equations. 

A two-step Lax-Wendroff finite difference schane as outlined in ref. 5 
has been used as the algorithm for solving the Navier-Stokes equations. Two 
modifications have been made to the method of ref. 5. First, the transport 
properties - viscosity and thermal conductivity - have been made a function 
of the local temperature. Secondly, a coordinate transformation has been 
made in the y direction. The transformation, in the form 


n = In (3y + l) 


( 1 ) 


allows a larger nodal packing at the wall. Also, for unequal y increments, 
the differencing can be done in equal increments of h which greatly 
simplifies the algorithm. Details of the equations and differencing algo- 
rithm may be found in appendix A. 

The computations have been carried out over a rectangular grid system 
which represents a sharp leading-edge flat plate 15.2^ cm in length and a 
vertical height above the plate of U.67 cm. The computational grid is 
illustrated in figure 2. Nodal points upstream of the plate are initialized 
with post-shock conditions and those on the plate are initialized with the 
unshocked conditions. Symmetry conditions are imposed along the line 
y = 0 for nodes in front of the plate throughout the computations. 

The upper boundary was considered an out flow boiindary with values 
along it being determined by extrapolation from the interior points. The 
right hand boundary was moved downstream as the normal shock moved to the 
right until a maximum grid size of 200 x 28 was reached. Along the plate. 
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the no-slip conditions u “ 0 and v * 0 and the adiahatic wall condition 
^ ® 0 were imposed while the wall density was found from a second order 
extraiK^lation from the interior points. 


Code Vectorization 


System Chare^cteristic s: 

The STAR-100 is a large scale, high speed, logical and arthmetic 
computer utilizing many advanced design features such as vector processing 
and virtual addressing. It also contains vector arithmetic and functional 
units designed for pipt^line operations on a vector, where a vector is 
defined as a set of contiguous elements. The vector instruction performs 
operations on ordered scalars which are elements of the vector. The 
vector instructions read the scalars from consecutive storage locations 
over a specified address range called a field, perform the designated 
operation on each set of operands, and stores the results in consecutive 
addresses of a resultant field beginning at a specified starting address, 
i.e., one vector instruction operates on two vectors and stores a vector 
result. The starting address and vector length are contained in one 6k 
bit word which describes the vector and is called a vector descriptor. 

The STAR instruction set includes, in addition to vector add, sub- 
tract, multiply and divide, other vector instructions that are useful 
when vectorizing this and similar problems. They will be described at 
appropriate points in the text. The equation used to obtain timings of 
STAR vector instructions is of the form 

T - s + ail 

where T is the time in clcchs (one clock equals UO nanoseconds), s is the 
start up time which is different for each vector instruction, a is a 
constant which depends on the particular instruction cmd word length of the 
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operand end H Is the length of the vector. For examplei timing fo:r the 
vector add instruction using 6U hit operands is 


' T * (96 + H/2) clocks. 

For a vector computation, it is more efficient to use one long vector than 
to repeat the computation several times using shorter length vectors. With 
the shorter vector length the vector startup time is multiplied hy the num- 
ber of times the vector instruction is repeated. 

Storage Considerations ; 


As with serial computer systems, the core storage requirements on a 
vector computer must he taken into account. Even with virtual memory it is 
r ?essary to he aware of the storage needed for solving the problem. More 
core storage is needed to effectively use the vector version of the problem 
than is needed if it were run on a serial computer. When a FOF.TPJUI vector 
expression of more than two terms is evaluated, one or more temporary •'■ectors 
is used to store the intermediate result. In addition, some terms need to 
be repeated to form a long vector of the same length as the other computa- 
tion to avoid doing many repeated ccanputations with short vectors. As an 

example, the expression V ^ ■*■••• 

i - 1 ^i i,l i = 1 ®’i i,2 

written as 


should be 


where 


^ 1,1 "" \’^ 1,2 " ^ 2 “ *^ 2 , 1 ^ ® 2 , 2 ”^ 2 * 


J ^ J 

. . ■ <^-1 a.. ,b. . 

j = 1 1-1 i.J i,J 

ecc. The forming of the longer vector can be done by repeated vector transmits 

where the short vector is moved into the longer vector at appropriate loca- 


tions. 

Important factors in vectorizing the code are recognizing where vector 
arithmetic exists and creating temporary vectors where possible. These 
temporary vectors consist of vector expressions which are common to several 
equations. The temporeiry vectors are computed only once, then used in 
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- subse^ieiit calculations. J^ese t vectors cause aore storage to be 

needed, but avoid repeating vector calculations. Equivalencing of scaie 
temporary storage areas is used in the code to minialze the total storage 
needed. 

The present discussion is limited to a iprid size vhere all the 
variables necessaxy for computation will fit in central memory, i.e. 
the total storage required must be less than 500 K words. In this 
problem it is desirable to treat all the points in the whole mesh (grid) 
as one vector. This gives vectors of length t »■ H X N to use in the 
vector computation. These are the maximum length vectors that can be ob- 
tained and will be more exficlent on a vector computer than using vector 
length of M or N. 

Initially, the manner in which the data is to be stored internally in 
the computer must be determined. The option of storing by what will be 
defined as row or column storage is available. Referring to fig. 2 row 
storage means a vector is defined in the X direction. The first element of 
the vector is at the point x « 0, n * 0, the second element of the vector 
is at the point x - ri * 0, etc, until x * x , then the next element 

lUflX 

in the vector is the point x = 0, n = An and continues in this manner for 
the entire grid. Column storage means a vector is defined in the n 
direction. The first element of the vector is at the point x = 0, rj = 0. 

The second element of the vector is at x = 0, n * An etc, until n = 
then the next element in the vector is the point x * Ax, T) = 0 and continues 
in this manner for the entire ^id. 

When choosing the direction of storing the vector, the following were 
considered: 

1) how does the storage affect the boundary computation; 

2) what length vectors can be used; 

3) how does the storage affect increasing the number of grid 
points used during the computation as the right hand boundary 
moves downstream. 
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Row storage means the boundary along and along naO are vectors 

of^lengtir— H— aad-the^undaidLea^j»^d Jse eianpu ted u sing vector computations. 
The boundary at is a straightforward extrapolation while the boundary 

computations at rpO consist of lengthy equations involving m&ny computatiohs. 
The boundary ali^ig x«0 is held constcmt, and the values associated with It 
would be destroyed after each time step if vectors were of length N x M. 

Since the elements along x^O are not stored in contiguous locations « rede* 
fining the elements would req.uire special considerations. Increasing the 
number of grid points at which calculations sore made would not be reasonable 
for row storage, if the entire grid was treated as one vector. This would be 
increasing the length of the vector by inserting eloaents within the vector. 

If vectors of length N-1 were used for all the computations the grid could be 
increased in the x direction by simply increasing the length of the vector 
and the additional points would be the elements at the end of the vector. 

Also the boundary at x®0 would not be disturbed. Therefore, for a reasonable 
vector formulation using row storage, the vector computations would use 
vectors of length N-1 and would be repeated M-2 times (calculations at the 
upper and lower boundary are considered separately). 

Column storage means the boundary along x=0 is a vector. Theso values 
remain constant, so the vector computation can actually start with the 
second column x * Ax. The vectors run in the column direction sc the 
boundary conditions along x=0 are nc.er destroyed. The elements are not stored 
in contiguous locations along the boundaries ri * 0 and » sc they will 

require special consideration. The entire grid can be treated as a vector 
of length M X N and when the numbe. of points in the grid is increased 
it means only increasing the vector length by M elements. This essentially 
added one column to the grid. 

The program was coded using column storage. The deciding factor 
here in using the column storage as defined is not the boundary computation, 
which would probably be more advantageous using row storage, but the value 
of using the longer vectors for the interior computation and the ease of 
adding to the grid size. 
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Interior Point Computations - llhe 'basic finite difference equations used 

pfobXem"T0ligaIgtto n v ec t o ri ze -readily^ ao~alX commit at i ons f or^ ^ : _ _ 

the interior points can use vector Instructions. The difference 
equations are of the forms 

ij; {t + At, x) » ij^(t,x) -" (gl ^ X + dx) - Tlp(t, X - Ax)) 

which consists of a vector subtract, a multiply of a constant, (At/2Ax), 
and a vector, and then a second vector subtract. 

The computation of the interior points begins with the element at the 
point n = An, X * Ax, using the storage defined as column storage and a 
vector length of it = M » (N-l) -2. Using this vector length, all the 
points in the grid except the boundary along x=0, the point n®0, 

X * Ax and the point ^ are computed, see figure 2, This 

ir.eans that incorrect computations are made at the boundaries along t] ** 0 
and In order that a vector calculation of length it could be 

used, these calculations were made even though incorrect resiilts were 
stored. The STAE has control vector capability which prohibits the 
storage of certain elements in a vector. This capability could have 
been used to prevent incorrect results from being stored along the boundaries; 
but in using this featxure, the calculations are still performed even though 
the results are not stored. Since values for these bctindary elements are not 
used before correct values are computed, it did not seem necessary to use 
the control vector feature. After the calculation of the interior points have 
been made, the values at the boimdaries are computed and stored ever the 
incorrect results. 

Boundary Conditions ; 

With the method of storage defined as column storage used for the 
computations along n * 0 and it does not appear t'nat vector 

computations could be used because the elements are not in contiguous 
locations. The boundaries require considerable computations and it -would 


ORIGINAL PA6B O 
OF POOR QUAUTB 
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'b« desirable to use vector ccmputetions . To compute the values along 
n * 0» the values at ri “ ri * and n « 3Ah are used* If the values 

along n * An* Yepresdhted hy- vectors* vec to r 

computations could he made which votild result in a vector whose elements are 
the values of the houndary valties along 

In the evaluation of the houndary c "’ditions extensive ure is made of 
two instructions* transmit indexed list and tranmnit list indexed, each using 
one operand where the elements are not vectors. The transmit indexed list 
transmits noncontiguous information referenced hy an index vector to a Vector* 
This is essentially a gather technique, ^ere information is gathered from 
noncontiguous locations to form a vector, ^e transmit list indexed in* 
struct ion jjerfonas the reverse process, where the elements of a vector are 
treuismitted to noncontiguous locations in memory indicated hy an index vector. 
This is essentially a scatter to meaory. These two instructions enable a 
user to form vectors if they do not exist, use vector arithmetic, then scatter 
the vector result to nonvector storage. 

The vector instruction transmit indexed list is used to take elements 

along n“Aii and form a vector. The same instruction is used to form vectors 

from the elements ri*2Ars and other vectors from the elements along ri=3Ari. 

Once all the vectors needed in the computation are formed, they are used in 

expressions to compute a vector result of the boundary condition. This vector 

is then scattered back to the regular grid where the boundary along rt*0 is 

not a vector. This is accomplished by the transmit list to indexed vector 

which scatters a vector to noncontiguous locations, Along the n*ri 

max 

boundary a similar tecnnique is used where the noncontiguous elements are 
made into a vector. These vectors are used in vector computations to compute 
a vector result which is then scattered back to the noncontiguous locations 
along the boundary. 

Calculations Beguiling Other Vector Instructions ; 

Conditions where two different paths may be followed after a comparison 
has been made on vector elements might be referred to as a vector IF 
statement. Such a situation is encountered when computations involving 
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re»X gas flove ara toade.^ Hare^ sottt piriseierf:, sucli as transport 
properties, are preseriljad l^ ciarve fits over different t«iperature 
ran^^ rather than eai all incittsive analytic function. Th^ vector U?' 
stateaiint does not «Eist in OTiiT^ffilUJ~How^er7"tire - 

can he constructed fros the vector instructions com^pare, c<»^ess and merge. 
The elements of a vector are first conpared with the elemetta of aa©^ 
vector or a constant and a hit vector is the restilt. ©le element# of the 
hit vector are 1 or 0 depending upon i^ether the compar e ccHaditlOh vas 
true or false. This hit vector is then used with the original vector , 
with two congress instructions to form t^ vectors. Separate computations 
are when performed, each using the appropriate vector. The hit vector cem 
then he used to merge the two resultant vectors to form a siz^e resultant 
vector of the original length. 

Results 

The shock induced flow over a flat plate is representative of a class 
of flow field problems for which no solutions have been obtained due to the 
lack of computational resources. 

This program -vrais chosen for coding because: l) the solution requires 

: f'^rating the fvill two-dimensional iavier-Stokes equations; 2) the solution 
allowed a variable computational field size to be used; and 3) the solution 
required the determination of realistic boundary conditions. 

Three versions of the code were written: 1) a scalar version to be 

run on the CDC 6600 serial machine; 2) a fully vectorized version to be 
run on the STAR 100; and 3) a vector version which used scalar calculations 
to determine the boundary vaJ-ues. 

The fully vectorized version of the code accommodated a maximum flow 
field size of 100 x 28 which allowed adequate resolution of the flow. This 
version of the code required a total of 11+ dimensioned vaiiables and a core 
storage size of 260 K as opposed to the serial version of the code which 
required only 25 dimensioned variables and a core storage of 65 K, but used 
disks for storage of most of the data. 

On the STAR 100, two magnetic core storage page sizes are available 
for virtual address references. They are referred to as a large page size. 



searches ^eed to be made to relate a central memory address to a virtual 
memory address on a particular page. When the program was executed using 

o 

large instead of small pages, an improved efficiency of about one-third was 

a 

gained in the CPU time. Thus, large pages have been used in obtaining the 
following results. ^ 

A comparison of the running time for the scalar and fully vectorized r 
versions cf theccode are summarized in table'*!. Resets for both the 
RUN and FTK compilers are given. Computations made with the FTN compiler. 

p “ 

utilized the highest optimization level. Since the ccmputaticnal field 
size is dynamic, the tabulated times shown represent the total CFJ time 
required for the field to reach the given size. The conputaticnal speed 
advantage of the vector system over the serial system is obvious, withw 
the increase in speed directly proportional to the size of the computation 




field, i.e. vector length. 


C9 

A comparison of the time to do identical wcrx for the vector and 

scalar versions of the boundary value computaticns is tabulated ir table 2. 

^ a 

A ten fold, increase in the boundary vector length increases the required CPU 
time by 30% for the vector calculation whereas the identical scalar work 
requires approximately a ten fold increase in the CFJ time. 

Table 3 tabulates a breakdown oftthe total CPU time for the fully vector 
and scalar boundary calculation versions of the code, ror the vector version 
of the code, the total CPU time spent doing the boundary calculations as 
a percentage cf the total CPU time decreases dramatically as the rmber of 
grid points in the computational field becomes large when compared to the , 
number of elements in the boundary vector. On the other hand, the total 

if? 

CPU time required to do the scalar boundary calculations remains at ap- 
proximately 30% cf the total CFJ time. 
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As previously stated, the computation times quoted for the vector machine 
are for a program which resided entirely in control memory. To analyze the 
effect on computational speed of a program which required storage in excess 
of the central memory size, the grid size in the present code was increased 
so the program code and data would not fit in core. Thus, the virtual memory 
capability of the operating system was used to access, as needed, data or 
code which resided outside of central memory. The computational times 
were compared to the time required to do equivalent work using the in-core 
program. Equivalent CPU times were obtained, but the total wall clock time 
for the out-of-core run increased initially even though short vector 
computations were used. The increase was a factor of over 35 after only 200 
time steps. Data were stored and used in such a random manner that the major 
portion of the computation time was used for getting the needed variables 
into core. The increased wall clock time proved this program to be 
impractical for obtaining solutions if the grid size is such that the code 
and data cannot fit in central memory. 

The program has been modified in an attempt to minimize the use of wall 
clock time. The computational grid was broken up into blocks and the compu- 
tation of the grid within each block was computed for a time step. The block 
size was determined and the storage was arranged so that all the variables 
necessary for the computation of the block would fit in memory. At the 
beginning of each block computation, the data for that block would have to 
be accessed initially, but during the computation of the block, no data out- 
side of central memory would have to be accessed. Preliminary runs show that 
executing the modified program in this manner resulted in equivalent CPU 
times, but only a factor of five increase in the wall clock time required for 
identical work using the unmodified in-core program. 

Figures 3 and U show some typical results generated by the code. Figure 

3 is a plot of isotherms in the flow-field at a given instant of time. The 
leading edge of the plate is located at the left hand side of the figure. 

Here, the leading edge shock is located at the isotherm concentration at 


lU 


the right hand side of the figure. The time dependency of the flow is 
illustrated hy transverse velocity profiles for increasing iiines at a given x 


location. 






i CONCLUDING RH4ARKS 




It has been shovn that a flow field prcblem of such size as to be 

a * " 

impractical for routine running on a serial processing machine can be coded 

o 

and run in an efficient manner on a vector processing machine. 

D 

To extract this high performance from a vector processing machine such 

s ° 

as the STAR-100, careful consideration must be given to the formulation and 

r- 1 - - 

flow of the code being written to avoid such obvious inefficiencies as^^ wg^^ 

repeating identical_vector calculations to the more subtle relationship SfiB 

between data storage and boundary calculations. Even when the vector ler.gths 

D o 

used are relatively short, the vectorized version of the code is clearly 



superior to serial code. 

The vectorized code has demonstrated that branch operations 

c 

a 

encountered in imperfect gas calculations can be efficiently vectorized. It 
nas also been shown that the calculation of boundary conditions require 
careful consideration, but can be done efficiently using the vector processor 

When the numberlof boundary points is small, there is no’^apparent 

O 

advantage to vectorizing the boxindary value calculations. However, when 

“ a 

the number of boundary points becomes j.arge, scalar calculations can requirt 
more than 200 percent of the time required to do the identical work 
(including time required to construct the vectors; using vector instructions 
It has been demonstrated that a complex, tvc-dimensional , fluid flow 
problem can be run quickly on a vector processing machine without exceeding 
the core storage requirements of the system. Larger twc-dimensionai and 
axisymmetric flows as well as three-dimensional flows will certainly require 
more mass data storage than can be handled by the system core size. 
Preliminary results have shown that modifications can be made to the current 
vectorized code to handle an out of core size problem and still maintain 
efficient use of a vector computer. 



o 













The^equation (2) is differenced in the following] manner : 
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Table 2.- A Comparlscn cT Identical .-^ork for Scalar vs. 

Vector Boundary Condition Computations. ^ 
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